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Abstract. We consider the problem of approximating the majority depth (Liu and Singh, 
1993) of a point q with respect to an n-point set, S, by random sampling. At the heart 
of this problem is a data structures question: How can we preprocess a set of n lines so 
that we can quickly test whether a randomly selected vertex in the arrangement of these 
lines is above or below the median level. We describe a Monte-Carlo data structure for 
this problem that can be constructed in 0(n log n) time, can answer queries 0((logn) 4 / 3 ) 

u 
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1 Introduction 



A data depth measure quantifies the centrality of an individual (a point) with respect to a 
population (a point set). Depth measures are an important part of multivariate statistics 
and many have been defined, include Tukey depth [17], Oja depth |13j . simplicial depth |10j . 
majority depth [IT], and zonoid depth [8]. For an overview of data depth from a statistician's 
point of view, refer to the survey by Small [15]. For a computational geometer's point of 
view refer to Aloupis' survey pQ. 

IQ . In this paper, we focus on the bivariate majority depth measure. Let S be a set of 

n points in 1R 2 . For a pair x,y € S, the major side of x,y is the union of the (at most 2) 
closed halfplanes having both x and y on their boundary that contain at least n/2 points 
of S. The majority depth [HUH] of a point, q, with respect to S, is defined as the number 
of pairs x,y £ S that have q in their major side. 

?—i . Under the usual projective duality [9], the set S becomes a set, S*, of lines; pairs of 

points in S becomes vertices in the arrangement, A(S*), of S*; and q becomes a line, q*. 
The median-level of A(S*) is the closure of the set of points on lines in S that have exactly 
[n/2j lines of S above them. Then the majority depth of q with respect to S is equal to 
the number of vertices, x, in A(S*) such that 

1. x is above q* and x is above the median level; or 

2. x is below q* and x is below the median level. 

Chen and Morin [5] present an algorithm for computing majority depth that works in 
the dual. Their algorithm works by computing the median level, computing the intersections 
of q* with the median level, and using fast inversion counting to determine the number, t, 
of vertices of the arrangement sandwiched between q* and the median level. The majority 
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depth of q is then equal to Q) — t. The running time of this algorithm is within a logarithmic 
factor of m, the complexity of the median level. 

The maximum complexity of the median level of n lines has been the subject of 
intense study since it was first posed. The current best upper bound is 0(n 4 / 3 ), due to 
Dey [7] and the current best lower bound is 2^^° sn \ due to Toth |16] , The median level 
can be computed in time 0(min{mlog n,n 4 / 3 }) [21 [3]. Thus, the worst -case running time 
of Chen and Morin's majority depth algorithm is w(n(logn) c ) for any constant c, but no 
worse than 0(n 4 / 3 logra). 

It seems difficult for any algorithm that computes the exact majority depth of a 
point to avoid (at least implicitly) computing the median level of A(S*). This motivates 
approximation by random sampling. In particular, one can use the simple technique of 
sampling vertices of A(S*) and checking whether 

1. each sample lies above or below q*; and 

2. each sample lies above or below the median level of S* . 

In the primal, this is equivalent to taking random pairs of points in S and checking, for 
each such pair, (x,y), if, (1) the closed upper halfplane, h xy , with x and y on its boundary, 
contains q and (2) if h xy contains n/2 or more points of S. 

The former test takes constant time but the latter test leads to a data structuring 
problem: Preprocess the set S* so that one can quickly test, for any query point, x, whether 
x is above or below the median level of A(S*). (Equivalently, does a query halfplane, h, 
contain n/2 or more points of S.) We know of two immediate solutions to this problem. The 
first solution is to compute the median level explicitly, in 0(min{m log n, n 4 / 3 }) time, after 
which any query can be answered in O(logn) time by binary search on the x-coordinate of 
x. The second solution is to construct a half-space range counting structure — a partition 
tree — in 0(n log n) time that can count the number of points of S in h xy in 0(n 1//2 ) time 

®- 

The first solution is not terribly good, since Chen and Morin's algorithm shows that 
computing the exact majority depth of q can be done in time that is within a logarithmic 
factor of m, the complexity of the median level. (Though if the goal is to preprocess in 
order to approximate the majority depth for many different points, then this method may 
be the right choice.) 

In this paper, we show that the second solution can be improved considerably, at 
least when the application is approximating majority depth. In particular, we show that 
when the query point is a randomly chosen vertex of the arrangement A(S*), a careful 
combination of partition trees [I] and e-approximations |12] can be used to answer queries 
in 0((log ?i) 4 / 3 ) expected time. This faster query time means that we can use more random 
samples which leads to a more accurate approximation. 

The remainder of this paper is organized as follows. In Section [2] we review results 
on range counting and ^-approximations and show how they can be used for approximate 
range counting. In Section [3] we show how these approximate range counting results can be 
used to quickly answer queries about whether a random vertex of S* is above or below the 
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median level of S* . In Section [3] we briefly mention how all of this applies to the problem 
of approximating majority depth. Finally, Section [5] concludes with an open problem. 

2 Approximate Range Counting 

In this section, we consider the problem of approximate range counting. That is, we study 
algorithms to preprocess S so that, given a closed halfplane h and an integer i > 0, we can 
quickly return an integer Ti(h, S) such that 

\\hnS\- ri (h,S)\ < i . 

This data structure is such that queries are faster when the allowable error, i, is larger. 

There are no new results in this section. Rather it is a review of two relevant 
results on range searching and e- approximations that are closely related, but separated by 
nearly 20 years. The reason we do this is that, without a guide, it can take some effort to 
gather and assemble the pieces; some of the proofs are existential, some are stated in terms 
of discrepancy theory, and some are stated in terms of VC-dimension. The reader who 
already knows all this, or is uninterested in learning it, should skip directly to Lemma [2j 

The first tools we need come from a recent result of Chan on optimal partition trees 
and their application to exact halfspace range counting U Theorems 3.2 and 5.3, with help 
from Theorem 5.2]: 

Theorem 1. Let S be a set of n points in R 2 and let N > n be an integer. There exists a 
data structure that can preprocess S in 0(nlogN) expected time so that, with probability at 
least 1 — 1/N, for any query halfplane, h, the data structure can return \h n S\ in 0(n l l 2 ) 
time. 

We say that a halfplane, h, crosses a set, X, of points if h neither contains X nor is 
disjoint from X. The partition tree of Theorem Q] is actually a binary space partition tree. 
Each internal node, u, is a subset of R 2 and the two children of a node are the subsets of u 
obtained by cutting u with a line. Each leaf, w, in this tree has \w D S\ < 1. The 0(ra 1//2 ) 
query time is obtained by designing this tree so that, with probability at least 1 — 1/N, 
there are only 0(ra 1 / 2 ) nodes crossed by any halfplane. 

For a geometric graph G = (S,E), the crossing number of G is the maximum, over 
all halfplanes, h, of the number of edges uw £ E such that h crosses {u, w}. From Theorem[T] 
it is easy to derive a spanning tree of S with crossing number 0(n 1 / 2 ) using a bottom-up 
algorithm: Perform a post-order traversal of the partition tree. When processing a node u 
with children v and w, add an edge to the tree that joins an arbitrary point in v D S to an 
arbitrary point in w D S. Since a halfplane cannot cross any edge unless it also crosses the 
node at which the edge was created, this yields the following result [U Corollary 7.1]: 

Theorem 2. For any n point set, S, and any N > n, it is possible to compute, in 0{n log N) 
expected time, a spanning tree, T, of S that, with probability at least 1 — 1/N , has crossing 
number 0(n 1//2 ). 

A spanning tree is not quite what is needed for what follows. Rather, we require a 
matching of size L n /2j- To obtain this, we first convert the tree, T, from Theorem [2] into a 



3 



path by creating a path, P, that contains the vertices of T in the order they are discovered 
by a depth-first traversal. It is easy to verify that the crossing number of P is at most twice 
the crossing number of T. Next, we take every second edge of P to obtain a matching: 

Corollary 1. For any n point set, S, and any N > n, it is possible to compute, in 
0(nlog N) expected time, a matching, M, of S of size \n/2\ that, with probability at least 
1 — 1/N has crossing number 0(n 1 / 2 ). 

The following argument is due to Matousek, Welzl and Wernsich |12|, Lemma 2.5]. 
Assume, for simplicity, that n is even and let S' C S be obtained by taking exactly one 
endpoint from each edge in the matching M obtained by Corollary [TJ Consider some 
halfplane h, and let be the subset of the edges of M contained in h and let M*// be the 
subset of edges crossed by h. Then 

\hnS\ = 2\Ml\ + \M%\ . 

In particular, 

\hns\- \Mf? I < 2\h n s'\ < \h n s\ + \m£\ 

Since \M^\ £ 0(n 1//2 ), this is good news in terms of approximate range counting; the set 
S' is half the size of S, but 2\h n S'\ gives estimate of \h D S\ that is off by only 0(n 1 / 2 ). 
Next we show that this can be improved considerably with almost no effort. 

Rather than choose an arbitrary endpoint of each edge in M to take part in S' , we 
choose each one of the two endpoints at random (and independently of the other n/2 — 1 
choices). Then, each edge in MP has probability 1/2 of contributing a point to h D S' and 
each edge in Ml contributes exactly one point to h fl S'. Therefore, 

E[2|/ i n5 / |]=2(i|M, / | + i|M,f|) =\hns\ . 

That is, 2|/inS"| is an unbiased estimator of |/inS'|. Even better: the error of this estimator 
is (2 times) a binomial(|M^|, 1/2) random variable, with \MP\ G 0(n 1 / 2 ). Using standard 
results on the concentration of binomial random variables (i.e., Chernoff Bounds [6]), we 
immediately obtain: 

Pr{|2|/inS'| - \hnS\\ > c?i 1/4 (logiV) 1/2 } < 1/N , 

for some constant c > 0. That is, with probability 1 — 1/N, 2\h n S'\ estimates \h R S\ to 
within an error of 0(n 1 / 4 (log A r ) 1//2 ). Putting everything together, we obtain: 

Lemma 1. For any n point set, S, and any N > n, it is possible to compute, in 0(n log N) 
expected time, a subset S' of S of size \n/2~\ such that, with probability at least 1 — 1/N, 
for every halfplane h, 

\2\hr\S'\ - \hr\S\\ e 0{n l/i {logN) 1 ' 2 ) . 

What follows is another argument by Matousek, Welzl and Wernisch [12} Lemma 2.2]. 
By repeatedly applying Lemma HJ we obtain a sequence of O(logn) sets Sq D S± ■ ■ ■ D S r , 



4 



Sq = S and \Sj\ = [n/2 J ]. For j > 1, the set Sj can be computed from Sj_i in 
0{2~ 3 n\og N) time and has the property that, with probability at least 1 — 1/N, for every 
halfplane h, 

\v\hnSj\ -\hns\\€ o(2 3 ^ 4 n 1 / 4 (io g iv) 1 /2) . (i) 

At this point, we have come full circle. We store each of the sets So, ■ ■ ■ ,S r in an optimal 
partition tree (Theorem [1]) so that we can do range counting queries on each set Sj in 
0(|5j| 1//2 ) time. This (finally) gives the result we need on approximate range counting: 

Lemma 2. Given any set S of n points in M? and any N > n, there exists a data structure 
that can be constructed in 0(nlogN) expected time and, with probability at least 1 — 1/N, 
can, for any half space h and any integer i € {0, . . . , n}, return a number ri(h, S) such that 

\\h n S\ - r t (h, S)\ < i . 

Such a query takes 0(min{n 1//2 , (n/i) 2 / 3 (log iV) 1 / 3 }) expected time. 

Proof. The data structure is a sequence of optimal partition trees on the sets So, . . . , S r . 
All of these structures can be computed in 0(n log N) time, since \Sq\ = n and the size of 
each subsequent set decreases by a factor of 2. 

To answer a query, (h, i), we proceed as follows: If i < n 1 / 4 , then we perform exact 
range counting on the set So = S in 0(n 1 / 2 ) time to return the value \h fl S\. Otherwise, 
we perform range counting on the set Sj where j is the largest value that satisfies 

C2 3 ^V/ 4 (logiV) 1/2 <i , 

where the constant C depends on the constant in the big-Oh notation in (PJ. This means 
\Sj\ = 0((n/i) 4 / 3 (log iV) 2 / 3 )) and the query takes expected time 

0(|S,| 1 / 2 ) = 0((n/i) 2 / 3 (logiV) 1 / 3 ) , 

as required. □ 

Our main application of Lemma [2] is a test that checks whether a halfspace, h, 
contains n/2 or more points of S. 

Lemma 3. Given any set S of n points in R 2 and any N > n, there exists a data structure 
that can be constructed in 0(nlogN) expected time and, with probability at least 1 — 1/N, 
can, for any halfspace h determine if \h fl S| > n/2. Such a query takes expected time 

(Oin 1 / 2 ) for0<i<n^ 4 
0((n/i) 2 / 3 (log ./V) 1 / 3 ) otherwise, 



Q(i) 

where i = \\hf\S\ — n/2\. 



Proof. As preprocessing, we construct the data structure of Lemma [2j To perform a query, 
we perform a sequence of queries (h, ij), for j = 0, 1, 2, . . ., where ij = n/2 3 . The jth such 
query takes 0(2 2j / 3 (log N) 1 /^) time and the question, "is \h fl S| > n/2?" is resolved once 
n/2 3 < i/2. Since the cost of successive queries is exponentially increasing, this final query 
takes time 0(min{n 1//2 , (n/z) 2//3 (log iV) 1 / 3 }) and dominates the total query time. □ 
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3 Side of Median Level Testing 

We are now ready to tackle the main problem that comes up in trying to estimate majority 
depth by random sampling: Given a range pair of points x,y € S, determine if there are 
more than n/2 points in the upper halfspace, h xy , whose boundary is the line through x 
and y. In this section, though, it will be more natural to work in the dual setting. Here 
the question becomes: Given a random vertex, x, of A(S*), determine whether x is above 
or below the median level of S* . The data structure in Lemma [3] answers these queries in 
time 0((ra/i) 2 / 3 (log iV) 1 / 3 ) when the vertex x is on the n/2 — i or n/2 + j level. 

Before proving our main theorem, we recall a result of Dey Theorem 4.2] about 
the maximum complexity of a sequence of levels. 

Lemma 4. Let L be any set of n lines and let s be the number of vertices of A{L) that are 
on levels k, k + 1, . . . , k + j. Then, s £ 0(n/c 1 / 3 j 2 / 3 ). 

We are interested in the special case of Lemma |4] where k = n/2 — i and j = 2i: 

Corollary 2. Let L be any set of n lines. Then, for any i € {1, . . . , n/2} the maximum 
total number of vertices of A(L) whose level is in {n/2 — i, . . . , n/2 + i} is 0(n 4 ' 3 z 2 / 3 ). 

Corollary [2] is useful because it gives bounds on the distribution of the level of a 
randomly chosen vertex of A(S*). 

Theorem 3. Given any set, L, of n lines and any c > 0, there exists a data structure 
that can test if a point x is above or below the median level of L. For any constant, c, this 
structure can be made to have the following properties: 

1. It can be constructed in 0(n log n) expected time and uses 0(n) space; 

2. with probability at least 1 — n~ c , it answers correctly for all possible queries; and 

3. when given a random vertex of A(L) as a query, the expected query time is 0((log n) 4//3 ). 

Proof. The data structure is, of course, the data structure of Lemma[3]with iV = n c . Let 
be the number of vertices of A{L) on levels n/2 — i and n/2 + i. Then the expected query 
time of this data structure is at most 




n/2 



(2) 



where, for sufficiently large n, Q(i) is upper-bounded by 




i=0 
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for some constant 7 > and all j G {0, . . . , n/2}. 

Working in our favour is that Q(i) > Q(i') for all i < i! . This implies that, to obtain 
an upper bound on F(uq, . . . , n n / 2 ), we can set 



y n . = |V /3 if J = (3) 
~^ 1 7n 4//3 j 2 / 3 otherwise 



for all j € {0, . . . , "</2}. To see why this is so, suppose we have a sequence S = no, . . . , n n j 2 
that satisfies Dey's constraints but for which X^i=o n * ^ ^<n 4 / 3 j 2 / 3 for some index j. If 
j = n/2 then we can obviously increase the value of rij, still satisfy Dey's constraints and 
increase the value of F(uq, . . . ,rij). Otherwise (j 6 {0, . . . , n/2 — 1}), the sequence 

S' = n , ■ ■ ■ , n,- + 5, n j+1 -5,..., n n/2 , 
where A = jn^^j 2 / 3 — Xli=o n *' a ^ so satisfies Dey's constraints. Furthermore, 

F(S') - F(S) = AQ(j) - AQ(j + 1) > , 

so F(S') > F(S). Repeatedly applying this type of modification (or using induction) shows 
that the sequence S = no, . . . , n n/ / 2 that satisifies ([3]) is a sequence that maximizes F(S). 

Finally, we can bound the sequence that satisfies ([3]) by differentiating jn^^j 2 ^ 
with respect to j. This yields rii £ Ofn 4 / 3 /^ 1 / 3 ) for all i £ {1, . . . , n/2}. Plugging this back 
into ([2]) yields 

F(no, • • • , n n/2 ) < ( O(n 4 / 3 Q(0)) + £ O^QW/i 1 / 3 ) 



(2) 
<o(l) 



iV4 



+ A E 0(n^l 2 ^) (4) 

V2J i= l 

n/2 

+ 7nY E O^W^^aogiV) 1 / 3 /,!^) (5) 
^ i=n l/4 + 1 

Recall that J" i -1 / 3 dz = |(n 2 / 3 — 1). Using this integral to bound the sum in (j4|) allows us 
to just squeak by: 



* £ 0(n 4 /V/VzV 3 ) 



^2; i=i 

-^-0(n 4 / 3 n 1 / 2 (n 1 / 4 ) 2 / 3 ) (bounding by integral) 

(2) 

0(1) 
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We are not so lucky with the sum in (0), which ends up being harmonic: 

n/2 

© = -7^ ^ 0(n 4 / 3 (n/i) 2/3 (logJV) 1/3 /* 1/3 ) 

n/2 

= £ OCOogJVjVs/f) 

i=nV 4 +l 

= 0((logn)(log iV) 1/3 ) (since £? =1 V* = O(logn)) 

= 0((logn) 4 / 3 ) , 

since ./V = n c and c is constant. To summarize, the expected running time of the query 
algorithm is at most 

F(no, . . . , n n/2 ) < o(l) + © + © = 0((log n) 4 / 3 ) . □ 
4 Estimating Majority Depth 

Finally, we return to our application, namely estimating majority depth. 

Theorem 4. Given a set S of n points in M? and any constant c > 0, there exists a data 
structure that can preprocess S in 0(n log n) expected time such that, for any point q, the 
data structure can compute, in 0(r (log n) 4 / 3 ) expected time, a value d'(q,S) such that 

Pr { ^ g -Jf> 5)1 > e} < exp (-0 (frp)) + , 

where d(q,S) is the majority depth of p with respect to S and p = d(q, S)/(^) is the nor- 
malized majority depth of q. 

Proof. The data structure is the one described in Theorem [3l Let p = d(q, S)/ Q). Select r 
random vertices of A(S*) (by taking random pairs of lines in S*) and, for each sample, test 
if it contributes to d(q, S). This yields a count r' < r where 

E[r'] = rp . 

We then return the value d'(q,S) = (r'/rjQ), so that E[d'(q, S)] = d(q,S), as required. 

To prove the error bound, we use the fact that r' is a binomial(p, r) random variable. 
Applying Chernoff Bounds [6] on r' yields: 

Pr{|r' — rp\ > erp} < exp(—Q(e 2 rp)) . 

Finally, the algorithm may fail not because of badly chosen samples, but rather, because 
the data structure of Theorem [3] fails. The probability that this happens is at most n~ c . 
Therefore, the overall result follows from the union bound. □ 
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5 Conclusions 



Although the estimation of majority depth is the original motivation for studying this prob- 
lem, the underlying question of the tradeoffs involved in preprocessing for testing whether 
a point is above or below the median level seems a fundamental question that is still far 
from answered. In particular, we have no good answer to the following question: 

Open Problem 1. What is the fastest linear-space data structure for testing if an arbitrary 
query point is above or below the median level of a set of n lines? 

To the best of our knowledge, the current state of the art is partition trees, which 
can only answer these queries in 0(re 1 / 2 ) time. 
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